1. Credit Card Applications - Intro and Importing the Data

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, I will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.

image.png

Credit card being held in hand

We'll use the Credit Card Approval dataset from the UCI Machine Learning Repository. The structure of this notebook is as follows:

First, I will start off by loading and viewing the dataset. I will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries. I will have to preprocess the dataset to ensure the machine learning model we choose utilise the data and so it can make good predictions. After our data is in good shape, I will do some exploratory data analysis to build our intuitions. Finally, I will build a machine learning model that can predict if an individual's application for a credit card will be accepted.

Firstly, after downloading and viewing the dataset, dataset has been anonymized to protect the client data therefore I have assigned some common credit rating features that could fit the data.

At the end of the notebook I will analysis which features are the most predicitve this is a commong process in all predictive models (and very relevant in banking and insurance). Models want to have the right balance of being predictive but not having too many variables as to make the application process too time consuming.

In [31]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
cc_apps = pd.read_csv(r'C:\Users\Adam\Desktop\Personal\Python Projects\Github\4) Credit Card Approvals\Credit Card Approval Data.txt', comment='#', header=None)

# Inspect data
print(cc_apps.head())

cc_apps.rename(columns={0:'gender',1:'Age',2:'Debt',3:'Married',4: 'Bank_Customer', 5:'Education_Level', \
                        6 :'Ethnicity', 7 :'Years_Employed', 8: 'prior_defualt',9:'Employed', 10: 'CreditScore', \
                        11: 'Drivers_Licence', 12: 'Citizen', 13 :'Zipcode', 14: 'Income', 15: 'Approved'}, inplace=True)

print("\n\n")

print(cc_apps.head())
  0      1      2  3  4  5  6     7  8  9   10 11 12     13   14 15
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  f  g  00202    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  f  g  00043  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  f  g  00280  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  t  g  00100    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  f  s  00120    0  +



  gender    Age   Debt Married Bank_Customer Education_Level Ethnicity  \
0      b  30.83  0.000       u             g               w         v   
1      a  58.67  4.460       u             g               q         h   
2      a  24.50  0.500       u             g               q         h   
3      b  27.83  1.540       u             g               w         v   
4      b  20.17  5.625       u             g               w         v   

   Years_Employed prior_defualt Employed  CreditScore Drivers_Licence Citizen  \
0            1.25             t        t            1               f       g   
1            3.04             t        t            6               f       g   
2            1.50             t        f            0               f       g   
3            3.75             t        t            5               t       g   
4            1.71             t        f            0               f       s   

  Zipcode  Income Approved  
0   00202       0        +  
1   00043     560        +  
2   00280     824        +  
3   00100       3        +  
4   00120       0        +  

2. Inspecting the applications

As can be see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other data issues that need to be fixed.

In [32]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print("\n")

# from inspecting the data a little further it is clear there are some ? values that represent missing values
print(cc_apps.tail(30))
             Debt  Years_Employed  CreditScore         Income
count  690.000000      690.000000    690.00000     690.000000
mean     4.758725        2.223406      2.40000    1017.385507
std      4.978163        3.346513      4.86294    5210.102598
min      0.000000        0.000000      0.00000       0.000000
25%      1.000000        0.165000      0.00000       0.000000
50%      2.750000        1.000000      0.00000       5.000000
75%      7.207500        2.625000      3.00000     395.500000
max     28.000000       28.500000     67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   gender           690 non-null    object 
 1   Age              690 non-null    object 
 2   Debt             690 non-null    float64
 3   Married          690 non-null    object 
 4   Bank_Customer    690 non-null    object 
 5   Education_Level  690 non-null    object 
 6   Ethnicity        690 non-null    object 
 7   Years_Employed   690 non-null    float64
 8   prior_defualt    690 non-null    object 
 9   Employed         690 non-null    object 
 10  CreditScore      690 non-null    int64  
 11  Drivers_Licence  690 non-null    object 
 12  Citizen          690 non-null    object 
 13  Zipcode          690 non-null    object 
 14  Income           690 non-null    int64  
 15  Approved         690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB
None


    gender    Age    Debt Married Bank_Customer Education_Level Ethnicity  \
660      b  22.25   9.000       u             g              aa         v   
661      b  29.83   3.500       u             g               c         v   
662      a  23.50   1.500       u             g               w         v   
663      b  32.08   4.000       y             p              cc         v   
664      b  31.08   1.500       y             p               w         v   
665      b  31.83   0.040       y             p               m         v   
666      a  21.75  11.750       u             g               c         v   
667      a  17.92   0.540       u             g               c         v   
668      b  30.33   0.500       u             g               d         h   
669      b  51.83   2.040       y             p              ff        ff   
670      b  47.17   5.835       u             g               w         v   
671      b  25.83  12.835       u             g              cc         v   
672      a  50.25   0.835       u             g              aa         v   
673      ?  29.50   2.000       y             p               e         h   
674      a  37.33   2.500       u             g               i         h   
675      a  41.58   1.040       u             g              aa         v   
676      a  30.58  10.665       u             g               q         h   
677      b  19.42   7.250       u             g               m         v   
678      a  17.92  10.210       u             g              ff        ff   
679      a  20.08   1.250       u             g               c         v   
680      b  19.50   0.290       u             g               k         v   
681      b  27.83   1.000       y             p               d         h   
682      b  17.08   3.290       u             g               i         v   
683      b  36.42   0.750       y             p               d         v   
684      b  40.58   3.290       u             g               m         v   
685      b  21.08  10.085       y             p               e         h   
686      a  22.67   0.750       u             g               c         v   
687      a  25.25  13.500       y             p              ff        ff   
688      b  17.92   0.205       u             g              aa         v   
689      b  35.00   3.375       u             g               c         h   

     Years_Employed prior_defualt Employed  CreditScore Drivers_Licence  \
660           0.085             f        f            0               f   
661           0.165             f        f            0               f   
662           0.875             f        f            0               t   
663           1.500             f        f            0               t   
664           0.040             f        f            0               f   
665           0.040             f        f            0               f   
666           0.250             f        f            0               t   
667           1.750             f        t            1               t   
668           0.085             f        f            0               t   
669           1.500             f        f            0               f   
670           5.500             f        f            0               f   
671           0.500             f        f            0               f   
672           0.500             f        f            0               t   
673           2.000             f        f            0               f   
674           0.210             f        f            0               f   
675           0.665             f        f            0               f   
676           0.085             f        t           12               t   
677           0.040             f        t            1               f   
678           0.000             f        f            0               f   
679           0.000             f        f            0               f   
680           0.290             f        f            0               f   
681           3.000             f        f            0               f   
682           0.335             f        f            0               t   
683           0.585             f        f            0               f   
684           3.500             f        f            0               t   
685           1.250             f        f            0               f   
686           2.000             f        t            2               t   
687           2.000             f        t            1               t   
688           0.040             f        f            0               f   
689           8.290             f        f            0               t   

    Citizen Zipcode  Income Approved  
660       g   00000       0        -  
661       g   00216       0        -  
662       g   00160       0        -  
663       g   00120       0        -  
664       s   00160       0        -  
665       g   00000       0        -  
666       g   00180       0        -  
667       g   00080       5        -  
668       s   00252       0        -  
669       g   00120       1        -  
670       g   00465     150        -  
671       g   00000       2        -  
672       g   00240     117        -  
673       g   00256      17        -  
674       g   00260     246        -  
675       g   00240     237        -  
676       g   00129       3        -  
677       g   00100       1        -  
678       g   00000      50        -  
679       g   00000       0        -  
680       g   00280     364        -  
681       g   00176     537        -  
682       g   00140       2        -  
683       g   00240       3        -  
684       s   00400       0        -  
685       g   00260       0        -  
686       g   00200     394        -  
687       g   00200       1        -  
688       g   00280     750        -  
689       g   00000       0        -  

3. Handling the missing values (part i)

Looking into the data I've uncovered some issues that will affect the performance of our machine learning model(s) if they go unchanged:

Our dataset contains both numeric and non-numeric data (specifically data that is of float64, int64 and object types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.

The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Examining the data we can also get useful statistical information (like mean, max, and min) about the features that have numerical values. Finally, the dataset has missing values, which I'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.

For now, I will temporarily replace these missing value question marks with NaN.

In [33]:
cc_apps.replace('?', np.NaN, inplace=True)

print(cc_apps.tail(30))
    gender    Age    Debt Married Bank_Customer Education_Level Ethnicity  \
660      b  22.25   9.000       u             g              aa         v   
661      b  29.83   3.500       u             g               c         v   
662      a  23.50   1.500       u             g               w         v   
663      b  32.08   4.000       y             p              cc         v   
664      b  31.08   1.500       y             p               w         v   
665      b  31.83   0.040       y             p               m         v   
666      a  21.75  11.750       u             g               c         v   
667      a  17.92   0.540       u             g               c         v   
668      b  30.33   0.500       u             g               d         h   
669      b  51.83   2.040       y             p              ff        ff   
670      b  47.17   5.835       u             g               w         v   
671      b  25.83  12.835       u             g              cc         v   
672      a  50.25   0.835       u             g              aa         v   
673    NaN  29.50   2.000       y             p               e         h   
674      a  37.33   2.500       u             g               i         h   
675      a  41.58   1.040       u             g              aa         v   
676      a  30.58  10.665       u             g               q         h   
677      b  19.42   7.250       u             g               m         v   
678      a  17.92  10.210       u             g              ff        ff   
679      a  20.08   1.250       u             g               c         v   
680      b  19.50   0.290       u             g               k         v   
681      b  27.83   1.000       y             p               d         h   
682      b  17.08   3.290       u             g               i         v   
683      b  36.42   0.750       y             p               d         v   
684      b  40.58   3.290       u             g               m         v   
685      b  21.08  10.085       y             p               e         h   
686      a  22.67   0.750       u             g               c         v   
687      a  25.25  13.500       y             p              ff        ff   
688      b  17.92   0.205       u             g              aa         v   
689      b  35.00   3.375       u             g               c         h   

     Years_Employed prior_defualt Employed  CreditScore Drivers_Licence  \
660           0.085             f        f            0               f   
661           0.165             f        f            0               f   
662           0.875             f        f            0               t   
663           1.500             f        f            0               t   
664           0.040             f        f            0               f   
665           0.040             f        f            0               f   
666           0.250             f        f            0               t   
667           1.750             f        t            1               t   
668           0.085             f        f            0               t   
669           1.500             f        f            0               f   
670           5.500             f        f            0               f   
671           0.500             f        f            0               f   
672           0.500             f        f            0               t   
673           2.000             f        f            0               f   
674           0.210             f        f            0               f   
675           0.665             f        f            0               f   
676           0.085             f        t           12               t   
677           0.040             f        t            1               f   
678           0.000             f        f            0               f   
679           0.000             f        f            0               f   
680           0.290             f        f            0               f   
681           3.000             f        f            0               f   
682           0.335             f        f            0               t   
683           0.585             f        f            0               f   
684           3.500             f        f            0               t   
685           1.250             f        f            0               f   
686           2.000             f        t            2               t   
687           2.000             f        t            1               t   
688           0.040             f        f            0               f   
689           8.290             f        f            0               t   

    Citizen Zipcode  Income Approved  
660       g   00000       0        -  
661       g   00216       0        -  
662       g   00160       0        -  
663       g   00120       0        -  
664       s   00160       0        -  
665       g   00000       0        -  
666       g   00180       0        -  
667       g   00080       5        -  
668       s   00252       0        -  
669       g   00120       1        -  
670       g   00465     150        -  
671       g   00000       2        -  
672       g   00240     117        -  
673       g   00256      17        -  
674       g   00260     246        -  
675       g   00240     237        -  
676       g   00129       3        -  
677       g   00100       1        -  
678       g   00000      50        -  
679       g   00000       0        -  
680       g   00280     364        -  
681       g   00176     537        -  
682       g   00140       2        -  
683       g   00240       3        -  
684       s   00400       0        -  
685       g   00260       0        -  
686       g   00200     394        -  
687       g   00200       1        -  
688       g   00280     750        -  
689       g   00000       0        -  

4. Handling the missing values (part ii - imputing data - mean)

Dropping missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training.

Therefore for columns with float or integer vlaues I am going to impute the means of the columns for the NaN values

In [34]:
#due to age originally containing some ?s it was stored as an pandas object not a float so we must convert it to fill NaNs with the mean

cc_apps.Age = cc_apps.Age.astype(float)

# Impute the missing values with mean imputation
cc_apps.fillna(cc_apps.mean(), inplace=True)

#now it can be seen that all int and float onjects have no NaN values
cc_apps.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   gender           678 non-null    object 
 1   Age              690 non-null    float64
 2   Debt             690 non-null    float64
 3   Married          684 non-null    object 
 4   Bank_Customer    684 non-null    object 
 5   Education_Level  681 non-null    object 
 6   Ethnicity        681 non-null    object 
 7   Years_Employed   690 non-null    float64
 8   prior_defualt    690 non-null    object 
 9   Employed         690 non-null    object 
 10  CreditScore      690 non-null    int64  
 11  Drivers_Licence  690 non-null    object 
 12  Citizen          690 non-null    object 
 13  Zipcode          677 non-null    object 
 14  Income           690 non-null    int64  
 15  Approved         690 non-null    object 
dtypes: float64(3), int64(2), object(11)
memory usage: 86.4+ KB

5. Handling the missing values (part iii - imputing data - Most Frequent Value)

Having successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this why the mean imputation strategy would not work here. This needs a different treatment.

I am going to impute these missing values with the most frequent values as present in the respective columns. This is good practice when it comes to imputing missing values for categorical data in general.

In [35]:
# Iterate over each column of cc_apps
for col in cc_apps.columns :
    # Check if the column is of object type
    if cc_apps[col].dtype == 'object':
        # Impute with the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify

print(cc_apps.info())

print('\n')

print(cc_apps[cc_apps == 'NaN'].count())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   gender           690 non-null    object 
 1   Age              690 non-null    float64
 2   Debt             690 non-null    float64
 3   Married          690 non-null    object 
 4   Bank_Customer    690 non-null    object 
 5   Education_Level  690 non-null    object 
 6   Ethnicity        690 non-null    object 
 7   Years_Employed   690 non-null    float64
 8   prior_defualt    690 non-null    object 
 9   Employed         690 non-null    object 
 10  CreditScore      690 non-null    int64  
 11  Drivers_Licence  690 non-null    object 
 12  Citizen          690 non-null    object 
 13  Zipcode          690 non-null    object 
 14  Income           690 non-null    int64  
 15  Approved         690 non-null    object 
dtypes: float64(3), int64(2), object(11)
memory usage: 86.4+ KB
None


gender             0
Age                0
Debt               0
Married            0
Bank_Customer      0
Education_Level    0
Ethnicity          0
Years_Employed     0
prior_defualt      0
Employed           0
CreditScore        0
Drivers_Licence    0
Citizen            0
Zipcode            0
Income             0
Approved           0
dtype: int64
C:\Users\Adam\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)

6. Preprocessing the data (part i)

The missing values are now successfully handled.

There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into three main tasks:

Convert the non-numeric data into numeric. Split the data into train and test sets. Scale the feature values to a uniform range.

First, I will be converting all the non-numeric values into numeric ones. I do this because not only does it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. I will do this by using a technique called label encoding.

In [36]:
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.svm import SVC
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler


# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
LE = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps.columns:
    # Compare if the dtype is object
    if cc_apps[col].dtype =='object':
    # Use LabelEncoder to do the numeric transformation
        cc_apps[col]=LE.fit_transform(cc_apps[col])
        
# as you can see all object type columns have been changed into label encoded values        
print(cc_apps.head())
   gender    Age   Debt  Married  Bank_Customer  Education_Level  Ethnicity  \
0       1  30.83  0.000        2              1               13          8   
1       0  58.67  4.460        2              1               11          4   
2       0  24.50  0.500        2              1               11          4   
3       1  27.83  1.540        2              1               13          8   
4       1  20.17  5.625        2              1               13          8   

   Years_Employed  prior_defualt  Employed  CreditScore  Drivers_Licence  \
0            1.25              1         1            1                0   
1            3.04              1         1            6                0   
2            1.50              1         0            0                0   
3            3.75              1         1            5                1   
4            1.71              1         0            0                0   

   Citizen  Zipcode  Income  Approved  
0        0       68       0         0  
1        0       11     560         0  
2        0       96     824         0  
3        0       31       3         0  
4        2       37       0         0  

7. Splitting the dataset into train and test sets

Having successfully converted all the non-numeric values to numeric ones.

Now, I will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, I will first split the data and then apply the scaling.

Also, features like DriversLicense are not as important as the other features in the dataset for predicting credit card approvals. Therfore I will drop it, as to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as feature selection.

In [37]:
# Drop the the driver licence column and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop('Drivers_Licence',axis=1)

cc_appsVals = cc_apps.values

# Segregate features and labels into separate variables
X,y = cc_appsVals[:,0:14] , cc_appsVals[:,14].reshape(-1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                y,
                                test_size=0.30,
                                random_state=47)

8. Preprocessing the data (part ii)

The data is now split into two separate sets - train and test sets respectively. I am only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.

Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since I'm rescaling all the values to the range of 0-1.

In [38]:
# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0,1))

rescaledX_train = scaler.fit_transform(X_train)

rescaledX_test = scaler.fit_transform(X_test)

print(rescaledX_train[0], rescaledX_test[0])

print(rescaledX_train.shape,rescaledX_test.shape)
[0.         0.27699248 0.05214286 0.66666667 0.33333333 0.92857143
 0.88888889 0.03807018 1.         1.         0.23880597 0.
 0.21764706 0.02079   ] [1.         0.40702703 0.00837154 0.66666667 0.33333333 0.64285714
 0.88888889 0.00425    1.         0.         0.         0.
 0.56470588 0.        ]
(483, 14) (207, 14)

9. Fitting a logistic regression model to the train set

Essentially, predicting if a credit card application will be approved or not is a classification task. According to UCI, the dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.

This provides the bmetrics for comparing the models performance. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.

Which model should I pick though? The question to ask here is: are the features that affect the credit card approval decision process correlated with each other? Generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [39]:
# Instantiate a LogisticRegression classifier with default parameter values

logreg = LogisticRegression()

model = logreg.fit(rescaledX_train,y_train)

10. Making predictions and evaluating performance

But how well does my model perform?

I will now evaluate our model with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approved applications as well as as denied applications. If my model is not performing well in this aspect, then it might end up rejecting an application that should have been approved.

The classification report helps us to view our model's performance from these aspects, by providing us with the precision and recall results for predicting the approvals and rejections.

We can also plot an ROC curve as a way to visualise the predictive performance of the model

In [40]:
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix, balanced_accuracy_score

# Use logreg to predict instances from the test set and store it
y_pred = model.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", model.score(rescaledX_test,y_test))

# Print the classification of the logreg model
print(classification_report(y_test, y_pred))


# Plotting an ROC curve to examine the predicitve capabilities of each algorithmn
# keep probabilities for the positive outcome only
y_predPOSONLY = model.predict_proba(rescaledX_test)[:,1].astype(float)
            
#using this tells you which order the classes are in
logreg.classes_
            
#generate a no skill model for comparison that always guesses 1 
ns_pred = [1 for _ in range(len(y_test))]

#Calculate ROC scores
ns_auc = roc_auc_score(y_test,ns_pred)
y_pred_auc = roc_auc_score(y_test,y_predPOSONLY)

print ('\n' +'No skill model ROC score: {}'.format(ns_auc))           

print ('\n' +'LogReg model ROC score: {}'.format(y_pred_auc)) 


# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_pred)
y_fpr, y_tpr, _ = roc_curve(y_test, y_predPOSONLY)

fig1, ax = plt.subplots(figsize=(10,6))
# plot the roc curve for the model
ax.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
ax.plot(y_fpr, y_tpr, marker='.', label='LogReg')
# axis labels
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
            
ax.set_title("LogReg predictive performance for Credit Card Approvals")
# show the legend
ax.legend()
# show the plot
plt.show()
Accuracy of logistic regression classifier:  0.855072463768116
              precision    recall  f1-score   support

         0.0       0.77      0.92      0.84        85
         1.0       0.93      0.81      0.87       122

    accuracy                           0.86       207
   macro avg       0.85      0.86      0.85       207
weighted avg       0.87      0.86      0.86       207


No skill model ROC score: 0.5

LogReg model ROC score: 0.9214079074252651

11. Grid searching and making the model perform better

The model is pretty good! It was able to yield an accuracy score of almost 84% and an ROC score of 0.92%!

Let's see if I can do better though!. I can perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.

Scikit-learn's implementation of logistic regression consists of different hyperparameters but we will grid search over the following two:

tol max_iter

In [24]:
# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = {'tol' : tol, 'max_iter': max_iter}

12. Comparing the hyptertuned performance to the standard

I have defined the grid of hyperparameter values and converted them into a single dictionary format which GridSearchCV() expects as one of its parameters. Now, we will begin the grid search to see which values perform best.

I will instantiate GridSearchCV() with our earlier logreg model with all the data we have. Instead of passing train and test sets separately, I will supply X (scaled version) and y. I will also instruct GridSearchCV() to perform a cross-validation of five folds.

What you can see is by hypertuning the model we are able to improve the accuracy score from 84 to 85%!

In [41]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_ , grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))
Best: 0.850725 using {'max_iter': 100, 'tol': 0.01}

13. Analysing which features have the most preditive power.

After fitting the hypetertuned model we can asses the coffecients of the model.

Regregression models have the form y=Ax1 + Bx2 + Cx3.... where in this case eacah x represents a feature fed into our model ( Age, married etc...) and we can extract the paramters for each model (A,B, C)

a large postitive paramater for a Factor sugests a high value in that factor corresponds to being more likely to be accepted. E.g. the having a high score for marriage (being married) means you are more likely to be accepted

where as having a high score for previous default means your more likely to be rejected (as it has a large negative parameter!)

In [48]:
reg_coeffs = model.coef_.reshape(-1,)

""" we can plot the coefficients """

# Plot the coefficients
plt.plot(reg_coeffs)
plt.xticks(range(len(cc_apps.columns)), cc_apps.columns, rotation=60)
plt.margins(0.02)
plt.show()

Wrapping up!

While building this credit card application outcome predictor, I tackled some of the most widely-known preprocessing steps such as scaling, label encoding, and missing value imputation. I finished with some machine learning to predict if a person's application for a credit card would get approved and how to assess the most predicitve variables in a mahcine learning model!

In [ ]: